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Abstract - We consider the problem of estimating the principal components of a population 
covariance matrix from a limited number of measurement data. Using a combination of random 
matrix and information-theoretic tools, we show that all the eigenmodes of the sample correlation 
matrices are informative, and not only the top ones. We show how this information can be 
exploited when prior information about the principal component, such as whether it is localized 
or not, is available by mapping the estimation problem onto the search for the ground state of 
a spin-glass-like effective Hamiltonian encoding the prior. Results are illustrated numerically on 
the spiked covariance model. 
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Introduction. — The availability of large-scale mea- 
^ surements of complex systems, such as in biology, finance, 
^sociology, ... calls for new methods to extract information 
-4xom those data. Of crucial importance is the characteriza¬ 


<N 


tion of the correlation structure of the data, which reflects 


^the underlying interaction network between the system 
components. A widely-used technique is principal compo- 
OO nent analysis (PCA), which retains only the components 
CN corresponding to the largest eigenvalues of the empirical 
correlation matrix computed from the data, considered as 
^"Amost informative. PCA applications range from computer 
CQ vision [1] to finance [2], to neuroscience [3] and many oth- 


m 


o3 


ers. PCA can, however, be inefficient in some cases [4], in 
particular when the number T of available data is compa¬ 
rable to the number N of system components, a situation 
referred to as high-dimensional data analysis [5]. 

In this Letter we focus on one aspect of this question, 
^namely, how to estimate the main eigenmode(s) of the 
‘true’ system correlation matrix at large N/T ratio. We 
show that considering only the main components of the 
empirical correlation matrix, as PCA does, is generally 
not optimal, and that taking into account the eigenmodes 
associated to the low eigenvalues can greatly improved the 
quality of the predictions. 

To fix notations let us consider a collection of N Gaus¬ 
sian random variables Xi (i = 1,..., N ), with zero means 



Fig. 1: Eigenvectors Cm of the sample correlation matrix and 
‘true’ component Ci in dimension N = 3. If the statistics is 
sufficient, i.e. r is low enough, the overlap w\ — Ci -Ci is finite 
in the large N limit, while the overlaps Wm = Ci -Cm (with 
m > 2) vanish as 0(l/y/N). 


and (population) covariances Cij. Assume we have ob¬ 
served T independent realizations of those variables, which 
define the N x T-dimensional rectangular matrix X , e.g. 
X % : 2 corresponds to the second observation of variable X 3 . 
The empirical covariance matrix, C = ^X • X\ also called 
sample covariance matrix is an estimator of the popula¬ 
tion covariance matrix C. Let us call Cm, m = 1,2,..., TV, 
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the normalized eigenvectors of C, corresponding to eigen¬ 
values Ai > A 2 > ... > A m . Similarly we call £ m and A m 
the normalized eigenvectors and eigenvalues (ranked in de¬ 
creasing order) of C. Suppose now that N is kept fixed and 
the number of observations T is increased. A perfect sam¬ 
pling condition corresponds to the limit of infinite mea¬ 
surements (T —)> 00 ), where the matrix estimator C ap¬ 
proaches to the true covariance C. This scenario changes 
if the number of variables N increases at the same pace 
as the number of measures T, with a fixed ratio r = N/T, 
hereafter called sampling noise. In that case, the estima¬ 
tor C differs from the true covariance matrix since it is 
affected by finite sampling effects, a common situation in 
experiments. Perfect sampling is recovered in the limit 
case r —)> 0 . 

While the typical [ 6 ] properties of the distribution of 
the eigenvalues of C are well characterized (and theoreti¬ 
cal results for rare events are available in the case of purely 
uncorrelated variables [7,8]), much less is known about its 
eigenvectors, see [9] or [10] and references therein. We 
want to estimate the top component £ 1 . A simple estima¬ 
tor is provided by £ 1 , which is naturally expected to be 
exact when r —» 0 (in the absence of eigenvalue degener¬ 
acy), see Fig. 1. For finite r, however, £1 is generally not a 
perfect estimator, and we show below how the knowledge 
of the other empirical eigenvectors £ m , with m > 2 , may 
considerably help to improve the estimate of £ 1 . 

Let us consider the scalar products w m = £i*£ m (Fig. 1 ). 
Those overlaps are stochastic variables with zero mean, 
and variances where (•) denotes the average over 

X. Each eigenvector £ m taken individually is very weakly 
informative about £1 as the overlap vanishes as TV -1 / 1 2 . On 
the contrary we show below that the mutual information 
between an extensive (of the order of N) number of eigen¬ 
vectors £ m (ra > 2 ) and £1 remains finite when N 00 . 
We then present one application where the knowledge of 
the overlaps w m helps us to improve our prediction of 
the top component £1 in the presence of prior information 
about this vector (here, that it has “large” components). 


where A±(r) = (1 =b yjr) 2 are the edges of the distribu¬ 
tion h For ‘strong’ signal, 7 > 7 c (r), the spectrum p is 
equal to the MP spectrum, and includes one eigenvalue, 
isolated from the MP bulk and centered in 

7 ( 7 ) = 7 + r — t • ( 2 ) 

7-1 

The onset of a signal-related eigenvalue 7 at the critical 
value of 7 = 7 c (r) was first reported in [ 12 ] and mathe¬ 
matically proven in [13]. Similar ‘retarded learning’ tran¬ 
sitions are encountered in models of neural networks [14] 
and in the Gaussian Matrix ensemble [15]. The transition 
is pictorially represented in Fig. 2. 



Fig. 2: The spiked covariance model below (left, 7 < j c (r)) 
and above (right, 7 > 7 c (r*)). The eigenvalue spectrum given 
by the Marchenko-Pastur distribution in Eq. (1) is shown in 
red; for 7 > j c (r), the signal eigenvalue 7 ( 7 ), see Eq. (2), is 
represented by a vertical red line. The squared overlap function 
W 2 ( 7 , A) in Eq. (5) is shown in black over the interval [A_, A+]; 
the vertical dashed line locates the edge A+. Note that W 2 is 
rescaled by a factor 0.1 to fit in the figure. 


We define IF 2 (A,A) as the mean squared overlap, mul¬ 
tiplied by N , between the eigenvectors of C and C associ¬ 
ated to, respectively, the eigenvalues A and A, namely: 


tt 2 (a,a) = E 

£,m 


{(Hc£m) 2 S(\-\ e )5(\-\m)) 

A^(A)p(A) 


(3) 


The spiked covariance model. — We will illustrate 
our approach on the spiked covariance model, a popular 
model in random matrix theory, in which all the eigenval¬ 
ues of C, but one, say, Ai = 7 , are equal to unity. Eigen¬ 
value 7 represents the ‘signal’, with its associated eigen¬ 
vector £ 1 . We consider the case 7 > 1 below, but similar 

results are found for 7 < 1 . Let p( A) = — E^b- A -)) 

m 

be the average density of eigenvalues of C and p( A) the 
density of eigenvalues of C. 

For ‘weak’ signals, 7 < y c (r) = 1 + yT, p coincides with 
the spectrum of the covariance matrix of N independent 
variables, the so-called Marchenko-Pastur (MP) distribu¬ 
tion [ 11 ], defined as 


Pmp (^) 


Aa+-a)(a-a_) 

2tt\ r 


(i) 


The mean squared scalar products with the top compo¬ 
nent introduced above are given by 

(w 2 m ) = ±W 2 ( 7 , Am). (4) 

We will therefore fix A = 7 in the following. 

W 2 can be computed using statistical physics ap¬ 
proaches to random matrix theory [16, 17]. For A E 
[A_(r); A+(r)] spanning the MP bulk spectrum one 
has [ 20 ] 

W 2 (j,X) = — . (5) 

A — 7(7) 

W 2 in Eq. (5) is an increasing function of A, which diverges 
for A = 7 ( 7 ). Note that this divergence is always located 

1 We consider here the case r < 1. For r > la (5-peak in A = 0 of 

mass 1 — - is present. 


P-2 



















Estimating the principal components of correlation matrices from all their empirical eigenvectors 


outside the MP spectrum (as in both panels of Fig. 2 ), and 
coincides with the MP edge A + (r) for the critical signal 
eigenvalue 7 = 7 c (r) only. 

It is easy to obtain the expression for the overlap be¬ 
tween the top eigenvectors of C and C by exploiting the 
standard relation for orthonormal bases, Ylm w rn = 1, 01* 
its continuous version in the infinite -N limit: 


r 


W\ 1 ,\)p MP (X)d\ + {wl) = l. 


(6) 


We deduce from Eqs. ( 1 , 5 , 6 ) that the mean squared over¬ 
lap between the top components of the population and 
sample covariance matrices is given by 


H) 


0 

7 2 — 7 ( 7 ) 

7 ( 7 ) ( 7 - 1 ) 


7 < 7cW> 
7 > 7c M • 


(7) 


and is non-zero in the strong signal regime only. This 
result agrees with previous applications, e.g. to signal 
detection [ 18 ], or to the inference of relevant modes in 
inverse problems [ 19 ]. 


Mutual information between the empirical 
eigenvectors and the top component £1. — As 

a consequence, close to the transition point (both from 
above and from below) the mean squared overlap W 2 ( 7, A) 
has a strong asymmetric shape (see black curves in Fig. 2 ) 
showing how the sample eigenvectors close to the right 
edge are highly correlated to the top component. This 
correlation is informative and can be exploited to infer 
the principal component with appropriate algorithms, as 
we show in the next section. 

To quantify this information about the component £1 
contained in the eigenvectors we introduce the mutual 
information I between those variables. I is defined as the 
difference between the entropy of variable £1 and the en¬ 
tropy of £ conditioned to {4™}; it measures how much the 
knowledge of {£ m } reduces the uncertainty on the esti¬ 
mate of £. I is non-negative and vanishes if and only if £ 
and {4m} are independent [21]. 

Our goal is to understand whether the knowledge of 
many (of the order of N ) very small (of the order of 
1 /y/N) overlaps with the empirical eigenvectors is help¬ 
ful to determine the top component, i.e. if I has a finite 
limite in the N —>> 00 limit. As the exact computation 
of I is hard, we assume that the correlations between the 
overlaps are negligible for large N. Within this assump¬ 
tion, we calculate the mutual information I between £1 
and {£ m ; 2 < m < / TV}, with / < 1 is the fraction of 
retained empirical eigenvectors. The details of the calcu¬ 
lation, based on the use of the replica approach, are given 
in Appendix. Within the replica symmetric framework, 
we obtain 


^(Ci’ {^ m }) 


+ 



1 

= — min 

2 q,q 


(q - Q(f))q + log(l - q) 


dX / 3 (A) log(l + q W 2 (7, A)) , 


( 8 ) 



Fig. 3: Mutual information /(£ 1 , {4m}), Eq. (8) for the spiked 
covariance model with r — 0.5, divided by the number N of 
variables. Qualitatively similar curves are obtained when r is 
varied. The vertical dotted line corresponds to the transition 
point, below which the top empirical eigenvector is completely 
uncorrelated with the true (population) one, as entailed by 
Eq. (7) 

. Remarkably, the mutual information, / in Eq. (8), 
between an extensive number of empirical eigenvectors 
corresponding to lower eigenvalues and the true top 
components is finite positive even in this low sampling 
regime. 


where A(/) is such that / = dXp(X) and Q(/) = 

dA/ 5 (A) IE 2 (7, A). The mutual information is plot¬ 
ted in Fig. 3 as a function of 7, and for various values of 
/. It is strictly positive for all values of 7 ■ -^"(Cl? {Cm}) 
reaches a maximum at the transition point y c (r), separat¬ 
ing the weak and strong signal regimes. Moreover it is 
increasing with the fraction /. Our calculation therefore 
provides clear evidence for the fact that sample eigenvec¬ 
tors in the bulk of the MP spectrum are informative about 
the principal component of the population covariance ma¬ 
trix. This result is remarkable especially in the case of 
weak signals, where the top empirical eigenvector is not 
informative at all. 

Inference of principal component with prior 
knowledge. — Based on the study of the overlaps w m 
above we may express the principal component £1 as a 
weighted sum of the sample eigenvectors, 

Cl = Y (^1) Cl + 'y ^ O'm \/{ w m) Cm 5 

m =2 

where the cr m ’s are independent Gaussian variables of zero 
means, and unit variances (m > 2 ). Equation ( 9 ) implic¬ 
itly defines the likelihood of the component £1 given the 
sample eigenvectors. The rationale underlying Eq. ( 9 ) is 
that it gives back the right statistics for the scalar prod¬ 
ucts w m . In particular, the expected value (over the cr m ’s) 
of Ci ' Cm vanishes, while the average value of (£i • £ m ) 2 
coincides with (w^). 
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In the absence of any prior information the average 
value of £1 is simply equal to yj (w\) £ 1 , corresponding to 
the standard estimate widely used in literature. Actually 
this estimate discards the information contained in the 
sample eigenvectors £ m , and vanishes in the weak signal 
regime. Our purpose here is to improve over this simple 
estimate, by exploiting some prior information over the 
top component. 



y/y c 


Fig. 4: Overlap of £1 = (1, 0,..., 0) with the ground state %[ cGS} 
of —IPR (red circles) and with the sample top component £i 
(black squares) as a function of 7 for the spiked covariance 
model with r = 0.5 (y c ~ 1.7071) and N = 80 variables. Inset: 
zoom of the region slightly below 7 C ; overlap of £1 with the 
ground state of —IPR for different sizes A. All lines seem to 
intersect around 7/7 c ~ 0.83 in the poor sampling phase. 

In many practical applications, indeed, prior knowledge 
over the principal components is available, such as the en¬ 
tries of those components are positive, sparse, bounded 
from above, etc ... A physically-sound prior knowledge 
we consider hereafter is the localization of principal com¬ 
ponents, found to be important for the identification of 
site in contacts on the three-dimensional structure of 
proteins [24], or in the study of phonons in liquid crys¬ 
tals [25,26]. Drawing our inspiration from condensed mat¬ 
ter physics we consider the inverse participation ratio 

N 

IPR(^) = ]T[(£7] 4 , (10) 

i=1 

and look for estimates of the principal component with 
large IPR. This prior favors vectors with large entries, i.e. 
non vanishing in the large-A" limit. More precisely the ob¬ 
jective function to be maximized is the log-posterior distri¬ 
bution of £ 1 , which sums the IPR in Eq. (10) and the log- 
likelihood implicitly defined by Eq. (9). To simplify this 
computational problem we consider a discrete version of 
Eq. (9), where the cr m ’s are constrained to take ±1 values. 
As a result we have at our disposal a pool of 2 Ar_1 candi¬ 
date components for £1 with equal log-likelihoods. We can 
then simply look for the binary configuration {cr m ; m > 2} 
in this pool, which maximizes the IPR. 

To find the ground state of —IPR, denoted by £i GS) , 
we resort to simulated annealing with a Monte Carlo 
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Fig. 5: Same simulation as in Fig. 4 with a different top eigen¬ 
vector (£i)z oc w here the reconstruction is shown for differ¬ 
ent values of fi. Remarkably the second transition is present 
also in this case (with a threshold depending on fi), as shown 
in the insets for different values of A (A = 20, 40, 80). 


scheme 2 . The overlap between £{ GS) and £1 is shown for 
different sizes A in Fig. 4 . We find a better (higher) value 
than the one corresponding to the naive estimate based on 
£1. The improvement is maximal for 7 close to y c (r), that 
is, in the critical region separating weak from strong sig¬ 
nals. Remarkably, while the naive estimate breaks down 
for 7 < 7 c (r) regime in the large A limit this does not 
seem to be the case for our procedure. This result is in 
agreement with the prediction given by the Mutual In¬ 
formation (Eq. (8)) in Fig. 3 , which is positive even for 
low signals and reaches its maximum in correspondence of 
the transition. Therefore, the contribution of lower eigen¬ 
vectors in the estimation is prominent in this region, as 
expected. 

This scenario does not qualitatively depend on the 
choice of the eigenvector £1. In particular we have stud¬ 
ied both the cases of a very sparse eigenvector (see Fig. 4 , 
where £1 = (1, 0,..., 0)) and of a slow, power-law decay, 
(£i)i in with fi > As shown in Fig. 5 , our procedure 
results in a better prediction of the top eigenvector for all 
the values of /a we have tested. We stress again that the 
estimator in Eq. ( 9 ) exploits the information contained in 
the lower eigenmodes. A random search around £1, for in¬ 
stance by considering an estimator such as a £i+a/ 1 — a 2 77, 
where a = yj(w\) and 77 is a random vector orthogonal to 
£1, would not produce comparable results, especially in 
the weak signal region. 

Transition in prior knowledge-based inference. — 

As reported above, the insets of Fig. 4 and Fig. 5 suggest 
the existence of a transition point, well inside the weak 
signal regime, above which £1 may be approximately in- 

2 The inverse temperature (3 is slowly incremented by steps of 

6/3 = 5, from f3 = 50 to (3 = 200. For each temperature we at¬ 
tempt 20 iV Monte Carlo spin flips. At the end of this procedure, 
the configuration with lowest energy is retained. Average is taken 
over ~ 10 3 realizations of the measurement matrix X. 
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ferred with the help of prior knowledge, even for large 
system sizes. This transition bears a strong analogy with 
transitions taking place in the Hopfield model below the 
critical capacity. Indeed, the IPR in Eq. (10), once ex¬ 
pressed in terms of the cr m ’s, may be interpreted as minus 
the Hamiltonian of an effective spin system, with a mix¬ 
ture of k- body interactions, where k ranges from 1 to 4. 
The interactions 

N 4 

Jmi,1712,1713,rri 4 = y j \j( W rni) £i,rri£ (H) 

i=1 1=1 

are non-linear combinations of the eigenmodes compo¬ 
nents, and may have positive or negative signs. This spin- 
glass Hamiltonian is strongly reminiscent of the Hopfield 
model [27]. Each entry (£i)i, i = 1, ..., V, of the principal 
component may be interpreted as the ‘magnetization’ Mi 
of the spin configuration a along the ‘pattern’ i, whose 
m th entry is 7(«4)(£m)i, or, equivalently, 

N 

Mi(cr) = (12) 

m= 1 

with <ji = 1. Note that our Hamiltonian, 

N 

-IPR(6(<t)) = £m*(<7) 4 , (13) 

is however quartic, and not quadratic in the magnetiza¬ 
tions. Thanks to this analogy, the transition observed 
here can be interpreted in terms of the phase diagram of 
the Hopfield model [28]: At low temperatures and inter¬ 
mediate loads (comprised between ~ 0.05 and the critical 
capacity ~ 0.14) the patterns to be stored are local minima 
of the Hopfield Hamiltonian, and are uncorrelated with the 
ground state. This scenario holds in our case too. We show 
in Fig. 6 the distribution of the energies vs. the overlap, 
obtained through exhaustive searches of the configuration 
space for small sizes N < 25. Below the transition point, 
the ground-state vector £[ GS) is clearly not aligned along 

Cl. 

Conclusions. — As described by random matrix the¬ 
ory, the overlap between lower empirical eigenvectors and 
the top true one is very small, of the order of TV -1 / 2 and 
vanishes in the infinite N limit. In spite of this, in this 
paper we have shown how an extensive number of sample 
eigenvectors with low eigenvalues is strongly informative 
about the population principal components, by presenting 
a calculation of mutual information for the spike covari¬ 
ance model. 

Based on this result, we have introduced a general pro¬ 
cedure to exploit that information in the presence of prior 
knowledge, by mapping the inference problem onto the 
search for the the ground state of a spin-glass-like Hamil¬ 
tonian encoding the prior. We have shown the efficiency 
of the approach when one knows a priori that top compo¬ 
nents have large entries, which considerably improves the 
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Fig. 6: Typical distributions of energies of the states cr vs. the 
overlaps with the true eigenvector, above (top) and below (bot¬ 
tom panels) the transition value. The ground state is shown 
by the blue circle. Same model as in Fig. 4, with N = 15 and 
r — 0.5. For large N we expect the transition to take place at 
7/7 c — 0.83. 

standard inference and allows us to recover the compo¬ 
nent in the weak signal region, where naive inference fails. 
This finding agrees with recent results on non-negative 
PC A [29]. 

It would be interesting to understand how efficient is 
our procedure for other priors, or in cases where the value 
of the overlap distribution is unknown and eigenvalue¬ 
cleaning techniques [20,30]) must be used for its estima¬ 
tion. A limitation of our approach is the use of discrete 
(binary) variables a m in Eq. (9). In a forthcoming publi¬ 
cation we plan to study more refined algorithms by con¬ 
sidering continuous variables instead, in order to test the 
generality of the (second) transition found in the weak 
signal regime. 

We stress that our approach could also be extended 
to infer more than one principal components. While the 
case of a finite number of separated eigenvalues (multiple- 
spiked covariance model) is straightforward, it would be 
interesting to consider O(N )-dimensional degenerate sub¬ 
spaces, as in [31]. 
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Appendix: replica calculation of the mutual in¬ 
formation. — Here, we present the derivation of the 
mutual information Eq. (8). We assume that the com¬ 
ponents of are independent and identically Gaussianly 
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distributed, with zero means and unit variances, and that 
the dot products w m are also independent normal vari¬ 
ables with zero means and variances W^/N. The index 
m runs from 2 to Mi /iV, where / is the fraction of 
eigenvectors retained. Equation (9) thus fully defines the 
joint distribution of the eigenvectors, P[£ i, {£ m }]. We de¬ 
fine Z{n) = Ml Tim dim Ml,{&*}]"• M) is obvi- 
ously equal to unity by normalisation of P, and the mu¬ 
tual information I is simply related to the derivative of 
Z in n = 1, see below. Along the lines of the replica 
method, we will first consider that n is an integer, and 
will next perform an analytic continuation to real-valued 
n on the outcome. After integrating out the eigenvector 
components and within the replica symmetric hypothesis, 
we obtain that F n = log Z(n) is given, up to some ir¬ 
relevant additive constant, by the saddle-point value of 


n(n — 1) 


QS - - qs 


(14) 


2 l0g 


n— 1 




q-q 


- ^E l0 S( 1 + W m( S -*)) 

m 

1 i f us 

- 2»^ l0g ( 1_ I 


+ ltZ(s-3) 


over its four arguments q , g, s, s. The parameters g, q are 
equal to, respectively, where the 

overbears denotes the averages over the eigenvectors and 
the brackets denotes the averages over the Gaussian mea¬ 
sure over the overlaps at fixed eigenvectors; s, s are the 
conjugated Lagrange parameters. 

A straightforward calculation shows that the mutual 
information I in Eq. (8) is given by , {£™}) = 

— I ,. We are therefore left with the resolution 
of the saddle-point equations over g, g, s, and s. First, 
let us remark that Pi depends on g and s only: Pi = 

— ^2 - Tv lQ g (l — s W^). The values of the order 
parameters at the saddle point are thus g = 

5 = 0. We next consider F n= i +e = Pi — el + 0(e 2 ), where 


1 = f + 

m 

- ^E^ + ^ 1 °g( 1 - < ?) • ( 15 ) 

m 


To the lowest order in e, the values of g and s are un¬ 
changed with respect to the case n = 1. The saddle-point 

equations for q and s give s = = -§ jFw % ' 

The corresponding expression for I is given in Eq. (8). 
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